Data clustering and noise undressing of correlation matrices 



Matteo Marsili 

Istituto Nazionale per la Fisica della Materia (INFM), Trieste- SIS S A Unit, V. Beirut 2-4, Trieste 1-34014 

(February 1, 2008) 

We discuss a new approach to data clustering. We find that maximum likelyhood leads naturally to 
an Hamiltonian of Potts variables which depends on the correlation matrix and whose low temper- 
ature behavior describes the correlation structure of the data. For random, uncorrelated data sets 
no correlation structure emerges. On the other hand for data sets with a built-in cluster structure, 
the method is able to detect and recover efficiently that structure. Finally we apply the method to 
financial time series, where the low temperature behavior reveals a non trivial clustering. 



Statistical mechanics typically addresses the question 
of how structures and order arising from interactions in 
extended systems are dressed, and eventually destroyed, 
by stochastic - so-called thermal - fluctuations. The in- 
verse problem, unraveling the structure of correlations 
from stochastic fluctuations in large data sets, has re- 
cently been addressed using ideas of statisical mechanics 
H|). This is the case of data clustering problems, where 
the goal is to classify N objects, defined by D dimensional 
vectors {£i}£Li, in equivalence classes. Generally jlj the 
idea is 1) postulate a cost function, which depends on 
the data sample, for each structure and 2) consider the 
cost function as an Hamiltonian and study its thermal 
properties. Structures are identified by configurations 
S = {si}fL 1 of class indices, where Sj is the equivalence 
class to which object % belongs. Regarding Sj as Potts 
spins, a Potts Hamiltonian H q = X^<j Ji.j^st.sj has been 
recently proposed as a cost function, with couplings 



Ji j decreasing with the distance di 
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tween objects i and j. The underlying structure of data 
sets emerges as the clustering of Potts variables at low 
temperatures. 

In this work we address the question of data cluster- 
ing for time series. Rather than postulating the form 
of the Hamiltonian, we start from a statistical ansatz 
and invoke maximum likelyhood and maximum entropy 
principles. In this way, the structure of the Hamiltonian 
arises naturally from the statistical ansatz, without the 
need of assumptions on its form. We study, by Monte- 
carlo method, this Hamiltonian for artificial time series: 
If time series are generated with some cluster structure 
S* , we find a low temperature phase which is dominated 
by cluster configurations close to S*. For random time 
series no low temperature phase is found. We also study 
time series of assets composing the S&P500 index, whose 
correlations have been the subject of much recent inter- 
est [|]-[| . Correlation matrices of financial time series are 
of great practical interest. Indeed they are at the basis 
of risk minimization in the modern portfolio theory Q . 
This states that, in order to reduce risk, the investment 
needs to be diversified (i.e. divided) on many uncorre- 
lated assets. However the measure of correlation in finite 
samples was recently found to be affected by considerable 



noise-dressing |3|. 

Our aim is to address the problem of revealing the 
structure of bare correlations hidden in a finite data set. 
Quite interestingly, our analysis of the S&P500 data set 
reveals a low temperature behavior dominated by few 
clusters of correlated assets with scale invariant proper- 
ties. The thermal average over the relevant cluster struc- 
tures provides a good fit of the financial correlations, 
which allows us to estimate the noise-undressed corre- 
lation matrix. Finally, we discuss several generalizations 
of our approach to generic data clustering. 

The data 5 = is composed of N sets & — 

{^i(d)}2=i of D measures. These are normalized to zero 
mean Yld&(d)/D = and unit variance ^2d£i(d)/D = 
1. We focus below on the case where £»(d) is the normal- 
ized daily returns of asset i of the S&P500 index, in day 
d 0. For the moment being, let us assume that &(<£) 
are Gaussian variables. The reason is that we want to 
focus exclusively on pairwise correlations and the Gaus- 
sian model is the only one which is completely specified 
at this level. We shall discuss below how to apply the 
method when £i(d) are not Gaussian. The key quantity 
of interest is the matrix 
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The spectral properties of this matrix, for uncorrelated 
time series, are known exactly [0. The spectrum of eigen- 
values A extends over an interval of size ~ N/D around 
A = 1, as shown in Fig. |]a. The spectrum of eigenval- 
ues of the S&P500 correlation matrix is also shown. The 
similarity of the two distributions for A ~ 1 suggests that 
significant noise-dressing due to finite D occurs ||. The 
tail of the distribution (A > 1) implies that some cor- 
relation is however present. The structure of correlation 
was analyzed both by minimal spanning tree method 
and by the method of ref. ||] in ref. ||. 

In order to explain this correlation, Noh || proposed 
the ansatz 



/g7~r) Si (d) + ei(d) 
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Here g s > and s, are integer variables (so-called Potts 
spins) , T) s (d) and (d) are iid gaussian variables with zero 
average and unit variance. In order to allow for totally 
uncorrelated sets, we allow Sj to take all integer values 
up to AT. In Eq. (^) sets are correlated in clusters labeled 
by s. The s th cluster is composed of n s sets with internal 
correlation c«, where 
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The correlation matrix generated by Eq. (Q) for D — > oo 
is Cjj = {g Si o~si,sj + $i,j)/0- + 5si)- Its distribution of 
eigenvalues is simple: To each s with n s > 1 there cor- 
respond one eigenvalue A s .o = (1 + g s n s )/(\ + g s ) and 
n s — 1 eigenvalues A s ,i = 1/(1 + g s ) Hence, large eigen- 
values correspond to groups of many (n s ^> 1) sets. For 
D finite, we expect noise to lift degeneracies between A s> i 
but to leave the structure of large eigenvalues unchanged. 
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FIG. 1. a) Distribution of eigenvalues of the correlation 
matrices of S&P500 (full line •) random (dotted x) and cor- 
related (dashed o) time series, b) Comparison of the spec- 
trum of the S&P500 correlation matrix (full line •) with 
noise-dressed (dotted □) and bare (dashed +) correlation ma- 
trices generated by Eq. ([?]). 

In order to fit the data set S with Eq. (0) , let us com- 
pute the likelyhood. This is the probability PCE\S, Q) 
of observing the data 3 as a realisation of Eq. (g) with 
structure S and parameters Q = {g s }^Li, and it reads 
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p(z\s,g) = Hl[{s &(d) 
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where the average is over all the 77's and e's variables and 
S(x) is Dirac's delta function. Gaussian integration leads 
to P(E\S,Q) oc e - DH ^ s ^ with H{S,G} = §£,[(1 + 
9s) K " ifi^n) ~ n * M 1 + 9s) + ln(l +g s n s )]. We fix the 



coupling strengths g s by likelyhood maximization = 
for all s, which yields 



(4) 



for n s > 1 and g s = for n s < 1. Note that for un- 
correlated sets Cjj = <5ij we have c s = n s Vs and hence 
g s = 0. The coupling strength g s instead diverges for 
totally correlated sets (dj = 1) because c s = n 2 s . Using 
Eq. (Q) we find that the likelyhood of structure S under 
ansatz (0) takes the form P(S|6>) oc e~ DHc , where 



Hc{S} = \ J2 



s:n B >0 



log — + (n s - 1) log ■ 
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The ground state Sq of H c yields the maximum likely- 
hood fit with Eq. (Q). This would probably take the 
ansatz (|J) too seriously. In general, it is preferrable 
to consider probabilistic solutions P{S} and, following 
ref. JjJ, we invoke the the maximum entropy principle: 
Among all distributions with the same average 

log-likelyhood, we select that which has maximal en- 
tropy. This, as usual, leads to the Gibbs distribution 
P{S} oc e~P Hc { s > where the inverse temperature (3 arises 
as a Lagrange multiplier. 

The Hamiltonian H c depends implicitcly on the Potts 
spins Si through the cluster variables n s and c s of Eq. 
(^). Unlike the Potts Hamiltonian H q , the dependence 
on S SilS . is non-linear and it is modulated by C\j. For 



7^ Sj for all i 7^ j we have n s 



1 for all s and 



hence H c = 0. This state is representative of the high 
temperature (/3 — > 0) limit. The low temperature physics 
of H c is instead non-trivially related to the correlation 
matrix d j . Note, that the ferromagnetic state Si = l,Vi, 
which dominates as — > 00 in clustering methods based 
on Potts models ||, is in general not the ground state 
of H c . Intuitively we expect that, if the model of Eq. 
(U) is reasonable, H c should have a well defined ground 
state and low temperature phase which is energetically 
dominated by this state. In these cases, as in ref. 0, we 
expect a thermal phase transition fl(i|| . 

In order to study the properties of H c we resort to 
Montecarlo (MC) method by Metropolis algorithm |jll| . 
This, at equilibration, allows us to sample the Gibbs dis- 
tribution -P{5} and compute average quantities, such as 
the internal energy E@ = {H c )p where (■ ■ -)p stands for 
thermal average. In order to detect the occurrence of 
spontaneous magnetization - which occurs if the Si lock 
into energetically favourable configurations at low tem- 
perature - we measure the autocorrelation function 



X(t,r) 



J2i<j $ 3j (t), Sj (t) $8i (t+r),aj (t+r) 
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This quantity tells us how many pairs of sites belonging 
to the same cluster at time t are still found in the same 
cluster after r MC steps. For t large enough, \ becomes 
a function of r only. This function decreases rapidly to 
a plateau value xp — (x(^ T ))/3 f° r t ^> t ^> 1. Clearly 
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Xfj — implies that no persistent structure is present 
whereas, at the other extreme, XfJ — 1 implies that all 
sites are locked in a persistent structure of clusters. 
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FIG. 2. a) Energy Ep as a function of /3 for random (x), 
S&P500 (+) and correlated (□) data sets of length D = 1599 
respectively. The results for the S&P500 data set over the 
last D — 400 days are also shown (•). Inset: square energy 
fluctuation SEp vs /3 for the same data sets (same symbols), 
b) Autocorrelation \/3 as a function of (3 for the same data sets 
(same symbols). The full (dashed) line refers to the overlap 
with the configuration s* for the S&P500 (correlated) data 
set with D = 1599. 

We monitored these quantities for three different data 
sets all composed by N — 443 time series: 1) uncorre- 
cted time series; 2) the time series of daily returns of the 
assets composing the S&P500 index J7|]l2"| 3) correlated 
time series generated by Eq. (^J) with given Si — s* and 
9s = 9* s - The first and the third data sets serve to test 
the method in cases where we know the answer. 

Let us start with a truly uncorrelated time series with 
D = 1599. We compute dj and study the corresponding 
Hamiltonian H c by the MC method. We do not expect 
any clustering to emerge in this case. Indeed, the internal 
energy Ep stays very close to (see Fig. ||a) for all 
values of (3 investigated up to (3 = 5 12. Correspondingly 
no persistent cluster arises, i.e. X/3 — 0- 

The results change turning to correlated data. Let us 
first discuss the S&P500 data for D = 1599: As Fig. 
||a shows, for (3 w 20 the energy Ep starts deviating 
significantly from zero. For (3 > 20 persistent clusters 
are present: x/3 rapidly raises from zero and it has a 
maximum at (3 w 40 (see Fig. |]b). The energy fluctu- 
ations reported in the inset shows a broad peak of in- 
tensity marking the onset of an ordered low temperature 
phase. As (3 increases the dynamics is significantly slowed 
down. At (3 w 200 the energy reaches a minimal value 



Ep ~ —0.117V and does not decrease significantly increas- 
ing (3 at least up to (3 — 4095. This energy is smaller than 
that of the ferromagnetic state (Ef = —0.086N), with all 
sets in the same cluster. The system in this range of tem- 
peratures visits only few configurations. 

The statistical properties of cluster configurations, as 
(3 varies, are shown in Fig. ^. For small j3 only small 
clusters survive to thermal fluctuations. As (3 increases 
a distribution of cluster sizes develops. At low tempera- 
tures the rank order plot of n s reveals a broad distribu- 
tion of clusters with the largest aggregating more than 
190 sets. By a power law fit of this distribution, we find 
that the number of clusters with more than n sets decays 
as rt~ - 83 . The scatter plot of c s versus n s also reveals a 
non-trivial power law dependence c s ~ nl 66 . This gives 
a statistical characterization of the dominant configura- 
tions of clusters at low energy jl3| . 
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FIG. 3. Rank plot of n s for several values of (3. The line 
corresponds to n ~ rank" 1,2 . Inset: c s versus n B for f3 = 256 
(•) and (3 (□). The line corresponds to c/simn 1 ' 66 . 

For D = 400 we find two transitions at f3\ w 20 and 
at fa ~ 80 which are signalled by bending in the Ep 
curve and by peaks in the SEg vs (3 plot. At the first 
temperature clusters start to appear. For (3 < (32 the 
largest cluster groups less than 30 sets and for (3 > P2 
larger clusters n s w 100 appear. This hints at a time 
dependence of correlations, which are averaged in the 
D = 1599 data set. For even shorter time series we found 
that sampling errors, acting like a temperature, destroy 
large clusters and only relatively small clusters (n s < 40 
for D = 60) were found. 

We build a syntetic correlated data set of D = 1599 
points using Eq. (||) with a structure S — S* and param- 
eters G = G* ■ The structure S* is a typical low energy 
configuration for the S&P500 data set for D = 1599. The 
parameters g* where deduced from the n s and c s of this 
configuration, via Eq. (|j). The distribution of eigenval- 
ues of Ci.j is shown in Fig. [j]a. This data set is useful for 
at least two reasons: first it allows one to understand to 
what extent a struture of correlation put by hand with 
the form dictated by Eq. (0) can be correctly recovered. 
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Secondly it allows us to compare the results found for the 
S&P500 data with those of a time series with correlations 
described by Eq. (||), of a similar nature. 

For (3 < 150, the behaviors of Ep, 5Ep and xp are 
similar to those found for the S&P500 data (see Fig. |^). 
A second, sharp peak in 5Ep at (3 s» 170 signals a new 
clustering phase transition. Below this temperature, as 
shown by the plot of x/3 (%■ ||b), the MC dynamics 
freezes into the original structure S*. The overlap with 
the configuration <S*, defined as in Eq. (Q) as the fraction 
of "bonds" Si = Sj for which s* — s*, quickly converges 
to 1 (see fig. ||d) for the syntetic time series, whereas 
it remains around 60% for the S&P500 data set. This, 
on one hand means that the original structure S* can 
be recovered quite efficiently. On the other hand, it sug- 
gests that several cluster configuration compete at low 
temperatures in the S&P500 data set. 

Eq. (^|) with a single cluster configuration (/3 — > oo), 
is inadequate to capture the full complexity of the corre- 
lations in the S&P500 data set. Probabilistic clustering, 
where several cluster structures S are allowed with their 
Gibbs probability -P{5} (and finite (3) provides a much 
better approximation. At finite [3 each set i may belong 
to several clusters and we can measure the correspond- 
ing coupling strenght g s j((3) = (g s 5 s-Si )p. Taking these 
as the parameters of the generalized model 



relation c we have r 



-[tan" 

7T L 



tan 



6(d) 



EsV9MVs(d)+ei(d) 



(J) 



we can build an artificial time series & and compute the 
correlation matrix c\^{D). Here (3 is a free parameter 
which can be adjusted to "fit" the S&P500 correlation 
matrix. The eigenvalue spectra of the two matrices are 
compared in fig. [j]b for (3 = 48. The value of [3 was 
choosen by visual inspection as that giving the best fit. 
The curves are remarkably close, suggesting that Eq. (0) 
provides a good statistical description of the correlations 
among assets. Fig. [j]b also shows the noise undressed 
matrix (oo), which allows one to appreciate the effect 
of noise dressing. As expected, noise mainly affects small 
eigenvalues. 

The applicability of the method can be extended con- 
siderably to a generic data set {xi}fL 1 . Xi need not 
be a time series. The distribution of Xi(d) need not 
be Gaussian and it does not even need to be the same 
across i. For example, Xi(d) may be the measure 
of the d th feature of the i th object or the concentra- 
tion of species i in the d th sample of an experiment. 
The idea is to map the data set Xi into a Gaussian 
time series £j to which we apply Eq. (^|). The map- 
ping results from requiring that non-parametric cross- 
rfj are preserved. To do this in prac- 
for the Xi data sets: 
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Inverting this relation, we find the correlation u — 
as a function of r = rf ■ — rf„-. This allows us to build 
the Hamiltonian which can then be studied. 

With respect to ref. Q , our approach does not need any 
assumption on the form of the Hamiltonian. As input, 
the method only needs the correlation matrix (or 
Ty). The range of interactions is set by the correlations 
themselves. For small D, the local interaction of ref. Q 
may well be more efficient in capturing the structure of 
data. Our method is most useful in cases where D ~ 
N ^> 1. These ideas can clearly be extended to models 
of correlations different from Eq. @ . 

I acknowledge R. Zecchina, R. Pastor-Satorras and L. 
Giada for interesting discussions and R. N. Mantegna for 
providing the S&P500 data. 
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outcome of a statistical fit. In the cases discussed here, 
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